Temporal Action Proposal (TAP) generation is an important problem, as fastand accurate extraction of semantically important (e.g. human actions) segmentsfrom untrimmed videos is an important step for large-scale video analysis. Wepropose a novel Temporal Unit Regression Network (TURN) model. There are twosalient aspects of TURN: (1) TURN jointly predicts action proposals and refinesthe temporal boundaries by temporal coordinate regression; (2) Fast computationis enabled by unit feature reuse: a long untrimmed video is decomposed intovideo units, which are reused as basic building blocks of temporal proposals.TURN outperforms the state-of-the-art methods under average recall (AR) by alarge margin on THUMOS-14 and ActivityNet datasets, and runs at over 880 framesper second (FPS) on a TITAN X GPU. We further apply TURN as a proposalgeneration stage for existing temporal action localization pipelines, itoutperforms state-of-the-art performance on THUMOS-14 and ActivityNet.
展开▼
机译:时间行动提案(TAP)的产生是一个重要的问题,因为从未修剪的视频中快速而准确地提取语义上重要的(例如人类行为)片段是进行大规模视频分析的重要步骤。我们提出了一种新颖的时间单位回归网络(TURN)模型。 TURN有两个显着方面:(1)TURN通过时间坐标回归共同预测动作建议并细化时间边界; (2)单元特征重用可实现快速计算:将未修剪的长视频分解为视频单元,这些视频单元可作为临时建议的基本构建块重用。在平均召回率(AR)方面,TURN的表现优于最新方法在THUMOS-14和ActivityNet数据集上保留空白,并在TITAN X GPU上以每秒880帧(FPS)以上的速度运行。我们进一步将TURN作为现有时间动作本地化管道的提案生成阶段,其性能优于THUMOS-14和ActivityNet上的最新性能。
展开▼